class: center, middle, inverse, title-slide .title[ # Inference with linear models ] .subtitle[ ## Lecture 4 ] .author[ ### Manuel Villarreal ] .date[ ### 08/28/24 ] --- ### Recap - In the last lecture we saw how to add interaction terms to a linear model in R using the `lm()` function. -- - A visual inspection of the residuals using histograms, scatter plots, and auto-correlation, showed us that the residuals for this model behaved in a way that agreed with our original assumptions. -- - Mean close to 0, constant variance, low auto-correlation. -- - However, the model that does not include the interactions is easier to interpret, and on average it's residuals seemed good enough. -- - How can we decide if adding an interaction, or any covariate for that matter, is supported by our data? --- ### Inference - The problem of selecting which covariates to add to a model, or which model to choose from a pool of candidates is known as an **Inference** problem. -- - There are multiple approach that can be used to find a solution which include: - Null Hypothesis Testing. - Model comparison. - Forward/Backward selection algorithms (please never use these!) -- **And most importantly: Theoretical Constraints**. That is, we choose the covariates that are most relevant to our scientific question or that are relevant according to a theory that we want to evaluate. --- ### One parameter NHT - Going back to our blood pressure example, let's say that we have a linear model of the expected blood pressure of a participant as a function of age: `$$\text{blood pressure}_i = \beta_0 + \beta_1\text{age} + \epsilon_i$$` and we want to test the hypothesis that the expected blood pressure of a participant increases as a function of age. -- - The first thing that we need to do is to define our hypothesis mathematically. --- ### Alternative hypothesis - The statement "a participant's blood pressure increases as a function of age" can be expressed mathematically as: `$$H_a:\beta_1 > 0$$` this means that we expect the value of the parameter `\(\beta_1\)` to be greater than 0. In other words we expect the regression line to have a positive slope. -- - The hypothesis under study is also known as the **Alternative** hypothesis. -- - As we will see next, the Alternative hypothesis is only used to specify the direction or type of test we will do. However, our conclusions will always refer to what we call the **Null** hypothesis. --- ### Null Hypothesis - The **Null** can be seen as the complement of the Alternative hypothesis. In other words, once the **Alternative** has been defined, all other possible values of a parameter or a coefficient will be assigned to the null. -- - For example, with our Alternative hypothesis "a participant's blood pressure increases as a function of age" or `\(H_a:\beta_1 > 0\)`, the **Null** would be specified as: `$$H_0:\beta_1 \leq 0$$` -- - And as we can see, the union of our two hypothesis is equal to all the possible values that our parameter `\(\beta_1\)` could take. --- ### The Null Hypothesis - **Notice** that a Null hypothesis does not necessarily state that the parameter or coefficient will be exactly equal to `\(0\)`, but rather that it takes on all other possible values outside of the hypothesis we wish to test. -- - Additionally, notice that the **Null** and **Alternative** hypothesis are always mutually exclusive. This is important, otherwise there would be values of the parameter for which we would not be able to decide if we can reject the null or not. -- - The **Null** hypothesis plays a central role in frequentist (classical) inference. -- - As we will see, our conclusions will always be tied to the **Null** hypothesis and our ability to reject it or our failure to do so. -- - In comparison, the **Alternative** hypothesis will only be used to determine the type of statistical test that we need to do. --- ### Null Hypothesys Testing - Let's look back at our simplelinear regression model: `$$\text{blood pressure}_i = \beta_0 + \beta_1\text{age} + \epsilon_i$$` - And change our original hypothesis and say that we wish to test if "a participant's blood pressure changes as a function of age". -- - Notice that we replaced the word "**increases**" with the word "**changes**". This is now a different hypothesis and we need to define it formally. -- - We can formally state our alternative and null hypothesis as follows: `$$H_0: \beta_1 = 0 \quad H_a: \beta_1 \neq 0$$` -- - **Notice** that the hypothesis are still mutually exclusive. --- ### Null Hypothesis Testing - Our new hypothesis just states that age will have an effect on blood pressure, however, we are not sure as to which direction that change will take. -- - In other words, the change could be -- - Positive: as age increases so does blood pressure. -- - Negative: as age increases blood pressure decreases. -- - When the **Null** hypothesis takes a single value we refer to them as **Point Null**. -- - As we will see, the type of test we have depends on how we specify our alternative hypothesis. --- ### Null Hypothesis Testing - Notice that both our **Null** and **Alternative** hypothesis are expressed in terms of the parameter `\(\beta_1\)`. -- - However, we don't have access to that value. The only thing we have is an estimator or statistic for that value given that we have a sample of the population. -- - Thus, in order to test our hypothesis, we need to be to construct some type of statistic (function of the sample) that depends only on what we know, which is our estimator `\(\hat\beta_1\)` and does not depend on the value of the parameter `\(\beta_1\)`. -- - To do this we will make use of some of the results that we talk about in a previous lecture. --- ### Building a statistical test - Remember that we said that the linear model has 4 main assumptions: 1. Errors have expectation 0. 1. Errors have constant variance 1. Error are independent. 1. Errors follow the same distribution. -- - We mention that if those assumptions hold true, then we know by the **Central Limit Theorem** that our estimators found using the Ordinary Least Squares (**OLS**) method follow approximately a normal distribution: `$$\hat{\beta}_0 \overset{\bullet}{\sim} N\left(\beta_0,\ \frac{\sigma^2\sum_{i=1}^n x_i^2}{n\sum_{i=1}^n (x_i - \bar{x})^2}\right)\quad,\quad \hat{\beta}_1 \overset{\bullet}{\sim} N\left(\beta_1,\ \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)$$` --- ### Building a statistical test - In general, just to have some notation, we say that: `$$\hat{\beta}_j \overset{\bullet}{\sim} N\left(\beta_j,\mathrm{V}ar(\beta_j)\right)$$` -- - Let's assume that the simple linear regression model we have meets the assumptions. Which means that `$$\hat{\beta}_1 \overset{\bullet}{\sim} N\left(\beta_1,\ \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)$$` -- - In this new relation we have the two values that we need to construct our statistic in order to make a test. --- ### Building a statistical test - Because we know that `\(\hat{\beta}_1\)` is approximately normal, that means that: `$$\frac{\hat{\beta}_1 - \beta_1}{\sqrt{\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}} \overset{\bullet}{\sim} N\left(0,\ 1\right)$$` -- - This is very similar to a z-score and is just a result from probability theory. -- - Now, if the **Null** hypothesis was true, that would mean that `\(\beta_1 = 0\)`. -- - This is an important step and is the reason why all our conclusions have to be stated in terms of the normal distribution and not the alternative. --- ### Building a statistical test - We can write our new variable as: `$$\frac{\hat{\beta}_1}{\sqrt{\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}} \overset{\bullet}{\sim} N\left(0,\ 1\right)$$` -- - Nothing on our equation has changed, however, we replaced the value of `\(\beta_1\)` with it's value when the **Null** hypothesis is true. -- - This new statement no longer contains the value of the parameter `\(\beta_1\)` and only depends on the value of `\(\hat\beta_1\)` which is what we wanted. -- - This means that we could use this new quantity to test our hypothesis and be done!... -- Well not so fast. --- ### Building a statistical test `$$\frac{\hat{\beta}_1}{\sqrt{\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}} \overset{\bullet}{\sim} N\left(0,\ 1\right)$$` - Notice that we did get rid of the parameter `\(\beta_1\)`, we introduced a new unknown to our equation which is the value of `\(\sigma^2\)`, the variance of the errors. -- - This means that we need to find a statistic or a function of the sample to replace it. -- - If you remember, when we started working with OLS estimators for the simple linear regression, we used to calculate the value of `\(\hat\sigma^2\)`. -- - We can use that value this time in order to replace `\(\sigma^2\)` and not have any unknown values in our equation! --- ### Building a statistical test `$$\frac{\hat{\beta}_1}{\sqrt{\frac{\hat\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}} \overset{\bullet}{\nsim} N\left(0,\ 1\right)$$` - Good news is that we don't have any more unknown values in our equation! We are almost done. -- - Bad news is that when replace `\(\sigma^2\)` with its unbiased estimator `\(\hat\sigma^2\)` we have introduced a new problem. The distribution of this new equation is no longer **Normal**. -- - Thankfully, statisticians have found what distribution this new thing follows! --- ### Building a statistical test `$$\frac{\hat{\beta}_1}{\sqrt{\frac{\hat\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}} \overset{\bullet}{\nsim} T_{n-(p+1)}$$` - This new quantity follows a T distribution with `\(n-(p+1)\)` degrees of freedom. where `\(p\)` represents the number of covariates in our linear model. -- - For example, in our simple linear regression model `\(p=1\)` which is age,and therefore, the degrees of freedom would be `\(n-2\)`. -- - This is why it is called a T test! -- - Why is this equation useful? well we can use it to calculate some probabilities. --- ### P-values - Now that we have out new statistic and know its distribution **when the Null** **hypothesis is true**, we can use this knowledge to calculate the probability of obtaining the value that we actually found or something more extreme. -- - This is by definition a **p-value**. A **p-value** is the probability of finding the value we found or something more extreme **when the null** **hypothesis is true**. -- - **Notice** that we emphasize that this is all true **only** when the Null hypothesis is true. -- - Formally we can express the p-value as: `$$P\left(\lvert t \rvert\geq\frac{\hat{\beta}_1}{\sqrt{\frac{\hat\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}} \right)$$`